Data Description

The “Spotify: All Time Top 2000 Mega Dataset” is a dataset from Kaggle that contains various audio statistics and ratings of the top 1994 songs on Spotify. For each song, it includes information such as the Title, Artist, Year of release, Top Genre, BPM (beats per minute), and Duration. It also includes various ratings, such as those measuring its level of Energy, Danceability, Loudness, Liveness, Valence, Acousticness, Speechiness, and Popularity. We manually added additional Genre, Decade, and Decade Range columns so that we could cluster songs into fewer groups, which will make our analyses more clear.



Research Questions

Using our dataset, we would like to answer three main questions:



Graphs

Question 1: Genres

In order to answer the first research question, we ______.

Graph 1

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Of the 10 quantitative variables in this dataset, it seems like only Energy, Danceability, and Popularity show a clear difference between the different overall genres (Note that for the above graphs, we have removed the “other” genre, since it encompasses too many miscellaneous songs and does not provide useful information). Within those three, it seems like country songs tend to have lower Energy than songs of other genres, hip hop songs tend to have higher Danceability than songs of other genres, and indie songs tend to have lower Popularity than songs of other genres. Note, however, that our sample size for the hip hop, indie, and country genres is relatively small, so the conclusions drawn here are not necessarily meaningful.

Graph 2

This dendrogram, made based off of the three variables described above, appears to show very little difference between the five genres. Each of the clusters appears to somewhat similarly distributed. Also, a majority of the data is categorized as rock, as our EDA shows.

Graph 3

Again, this graph appears to show very little difference between the genres. There appears to just be one cluster, containing all of the genres somewhat uniformly. Again, this verifies that we have more rock songs than anything else in this dataset.

Graph 2

This dendrogram, made based off of the three variables described above, appears to show very little difference between the five genres. Each of the clusters appears to somewhat similarly distributed. Also, a majority of the data is categorized as rock, as our EDA shows.

Graph 3

Again, this graph appears to show very little difference between the genres. There appears to just be one cluster, containing all of the genres somewhat uniformly. Again, this verifies that we have more rock songs than anything else in this dataset.


Question 2: Popularity

Graph 1

Graph 2

Graph 3


Question 3: Time

The third research question mainly concerns itself with time trends. Therefore, we explored various song attributes in the context of the Year, Decade, or Decade Range in which it was released.

Graph 1

Since we have so many quantitative variables, we first conducted principal component analysis (PCA). We then made a graph plotting the first two components, and colored our datapoints by the Decade Range variable so that we could make some comparisons regarding time without clouding the graph with too many overlapping colors.

We can see that Decade Range slightly clusters by the first two components, since there are mostly blue datapoints on the top and mostly red datapoints on the bottom. One example of a conclusion that can be drawn from the graph is that as BPM increases, both PC1 decreases and PC2 increases; furthermore, songs from the 1990s-2010s tend to have a greater number of beats per minute than songs from the 1950s-1980s. That being said, since the Normal distribution ellipses overlap quite a bit, there is not enough evidence to conclude that the two groups are significantly different with respect to their principal components.

Graph 2

To address some of our quantative variables, we made a comparison word cloud between the top genres of songs from the 1950s to the 1980s and the top genres of songs from the 1990s to the 2010s to provide insight on how the top genres have changed, if at all, between these two eras.

From this word cloud, there are a few song genres that almost exclusively appear in the 1950s-1980s, such as “adult standards,” “classic rock,” “album,” and “eurpop.” Meanwhile, “alternative,” “dutch,” “modern,” and “pop” music seem to be more popular genres in the era 1990s-2010s. Some genres such as “dance” are more common in the 1990s-2010s but since it is small, this means that it appeared in the 1950s-1980s almost as frequently.

Graph 3

To more closely monitor how a single attribute has changed over time, we constructed a time series plot with decomposition measuring Danceability.

The global trend can be seen in the second facet, which shows that Danceability was low until the 1970s, reached a peak around 1980, decreased from the late 1980s and 1990s, and has been steadily increasing since 2000. The seasonal trend can be seen in the third facet. The up-and-down nature of this plot suggests that Danceability falls in cycles that last about 5 years. Perhaps the main takeaway from this graph is therefore that not only does Danceability come in waves over the span of decades, but it also comes in waves between years.



Conclusions

Finally, we can conclude that .